Part A
1. Please refer the table below to answer below questions: [2 Marks]
| Planned to purchase Product A | Actually placed order for Product A - Yes | Actually placed order for Product A - No | Total |
|---|---|---|---|
| Yes | 400 | 100 | 500 |
| No | 200 | 1300 | 1500 |
| Total | 600 | 1400 | 2000 |
people planned to purchase.
1A. Refer above table and find the joint probability of the people who planned to purchase and actually placed an order
# 1A
actual = 400
total = 2000
jp = round(actual/total,4)
print('joint probability is %1.4f' %jp)
1B. Refer to the above table and find the joint probability of the people who planned to purchase and actually placed an order, given that
pp = 500
Planed_purchase = (actual/total)/ (pp/total)
print(' Answer is %1.4f' % Planed_purchase)
2. An electrical manufacturing company conducts quality checks at specified periods on the products it manufactures. Historically, the failure rate for the manufactured item is 5%. Suppose a random sample of 10 manufactured items is selected. Answer the following questions. [4 Marks]
import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
failure_rate = 0.05
sample_size = 10
temp = np.arange(0,11)
ans = stats.binom.pmf(temp,sample_size,failure_rate)
print(ans)
9.64808106e-04 6.09352488e-05 2.67259863e-06 8.03789062e-08
1.58642578e-09 1.85546875e-11 9.76562500e-14]
2.A. Probability that none of the items are defective? [1 Mark]
print('Probability that none of the items are defective is %1.4f' %ans[0])
2.B. Probability that exactly one of the items is defective? [1 Mark]
print('Probability that exactly one of the items is %1.4f' %ans[1])
2.C. Probability that two or fewer of the items are defective? [1 Mark]
Calculation of cumulative probability
cumans = stats.binom.cdf(temp,sample_size,failure_rate)
print(cumans)
0.99999992 1. 1. 1. 1. ]
print('Probability that two or fewer of the items are defective is %1.4f' %cumans[2])
2.D. Probability that three or more of the items are defective ? [1 Mark]
p_ans = 1 - cumans[2]
print('Probability that three or more of the items are defective is %1.4f' % p_ans)
3. A car salesman sells on an average 3 cars per week. [3 Marks]
3.A. What is Probability that in a given week he will sell some cars? [1 Mark]
mean = 3
net = np.arange(0,16)
poisson_law = stats.poisson.cdf(net,mean)
poisson_law
sell_some = 1 - poisson_law[0]
print('Probability that in a given week he will sell some cars is %1.4f' % sell_some)
3.B. What is Probability that in a given week he will sell 2 or more but less than 5 cars? [1 Mark]
prob1 = poisson_law[4]
prob2 = poisson_law[1]
sale = prob1 - prob2
print('Probability that in a given week he will sell 2 or more but less than 5 cars is %1.4f' % sale)
3.C. Plot the poisson distribution function for cumulative probability of cars sold per-week vs number of cars sold per week. [1 Mark]
plot_poisson = stats.poisson.pmf(net,mean)
plt.plot(net,plot_poisson, 'o-')
plt.title('Poisson')
plt.xlabel('Number of cars sold per week')
plt.ylabel('Cumulative probability of cars sold per-week')
plt.show()
Q 4 A What is the probability that all three orders will be recognised correctly?
#As per the situation
import scipy.stats as st
p=0.868
q=np.arange(0,4)
n=3
array=stats.binom.pmf(q,n,p)
print("The probability that ALL THREE orders will be recognised correctly is: %1.4f"%array[3])
4.B. What is the probability that none of the three orders will be recognised correctly? [1 Mark]
result1=st.binom.pmf(0,3,p)
print("The probability that NONE of the three orders will be recognised correctly is: %1.4f"%result1)
4.C. What is the probability that at least two of the three orders will be recognised correctly? [1 Mark]
result2=st.binom.sf(1,3,p)
print("The probability that atleast TWO of the three orders will be recognised correctly is: %1.4f"%result2)
- Explain 1 real life industry scenario (other than the ones mentioned above) where you can use the concepts learnt in this module of Applied Statistics to get data driven business solution. [3 Marks]
Answer: The above concepts are based on Basics of Probability,Discrete and Continous Probaility distribution. Hence, these concepts are very much beneficial to predict/forecast results related to stock markets and sharebazars. This may yield fruitful and productive results in predicting the probability of a stock market crash after a certain period of time.
Part B
- Read the data set, clean the data and prepare final dataset to be used for analysis. [10 Marks]
#Importing all the packages
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
import seaborn as sns
import pandas as pd
#Information relating to the dataset
dataset = pd.read_csv('DS+-+Part2+-+Basketball.csv')
dataset.head()
dataset.describe(include='all')
dataset.info()
dataset.shape
dataset.tail()
RangeIndex: 61 entries, 0 to 60
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Team 61 non-null object
1 Tournament 61 non-null int64
2 Score 61 non-null object
3 PlayedGames 61 non-null object
4 WonGames 61 non-null object
5 DrawnGames 61 non-null object
6 LostGames 61 non-null object
7 BasketScored 61 non-null object
8 BasketGiven 61 non-null object
9 TournamentChampion 61 non-null object
10 Runner-up 61 non-null object
11 TeamLaunch 61 non-null object
12 HighestPositionHeld 61 non-null int64
dtypes: int64(2), object(11)
memory usage: 6.3+ KB
#Cleaning the data values
filter_data = dataset.isnull().sum()
print(filter_data[0:13])
dataset= dataset.replace('-',0)
dataset = dataset.replace(to_replace ='to', value = '-', regex = True)
dataset = dataset.replace(to_replace ='_', value = '-', regex = True)
dataset = dataset.replace(to_replace ='~', value = '-', regex = True)
dataset.head(10)
Tournament 0
Score 0
PlayedGames 0
WonGames 0
DrawnGames 0
LostGames 0
BasketScored 0
BasketGiven 0
TournamentChampion 0
Runner-up 0
TeamLaunch 0
HighestPositionHeld 0
dtype: int64
#Converting the data type to int for further operations
import re
length=len(dataset['TeamLaunch'])
for tl in range(length):
x=dataset.loc[tl,"TeamLaunch"]
t = re.findall(r'\d\d\d\d',x)
dataset.loc[tl,"TeamLaunch"]=t
dataset['TeamLaunch'] = dataset['TeamLaunch'].astype(int)
dataset.head(10)
dataset_new = dataset.loc[:,dataset.columns !='Team']
dataset_new = dataset_new.astype(int)
dataset_new.info()
RangeIndex: 61 entries, 0 to 60
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Tournament 61 non-null int64
1 Score 61 non-null int64
2 PlayedGames 61 non-null int64
3 WonGames 61 non-null int64
4 DrawnGames 61 non-null int64
5 LostGames 61 non-null int64
6 BasketScored 61 non-null int64
7 BasketGiven 61 non-null int64
8 TournamentChampion 61 non-null int64
9 Runner-up 61 non-null int64
10 TeamLaunch 61 non-null int64
11 HighestPositionHeld 61 non-null int64
dtypes: int64(12)
memory usage: 5.8 KB
Q2 Perform detailed statistical analysis and EDA using univariate, bi-variate and multivariate EDA techniques to get data driven insights on recommending which teams they can approach which will be a deal win for them. Also as a data and statistics expert you have to develop a detailed performance report using this data.
#Appending 'Team' column to the first position
dataset_new['Team']=(dataset['Team'])
col_first =dataset_new.pop('Team')
dataset_new.insert(0, 'Team', col_first)
dataset_new.head()
#Teams who played the maximum no. of tournaments
dataset_new.groupby(['Team'])['Tournament'].sum().sort_values(ascending=False).head(10)
#Teams who played the minimum no. of tournaments
dataset_new.groupby(['Team'])['Tournament'].sum().sort_values(ascending=True).head(10)
#No. of teams with maximum percentage of championship in played games
dataset_new['Champions'] = dataset_new['TournamentChampion']/dataset_new['Tournament']
dataset_new['Champions'].dtype
Champions = dataset_new['Champions'].sort_values(ascending =False).head(10)
print('The top five Champion teams are given below')
dataset_new.groupby(['Team'])['Champions'].sum().sort_values(ascending =False).head(5)
#Winning percentage for the games played and games won
dataset_new['WinResult'] = dataset_new['WonGames']/dataset_new['PlayedGames']
dataset_new['WinResult'].dtype
Top_five = dataset_new['WinResult'].sort_values(ascending =False).head(10)
print('The top five teams who won the maximum no. of games are given below')
dataset_new.groupby(['Team'])['WinResult'].sum().sort_values(ascending =False).head(5)
#Calculation showing insertion of a new column for finding the age of team with reference to the year 2022 and oldest team
dataset_new['Age_of_team'] = 2022-dataset_new['TeamLaunch']
print('Age of the oldest team is: ',dataset_new['Age_of_team'].max())
print('Oldest teams in the tournaments are given below ')
print(dataset_new[dataset_new['Age_of_team'] == dataset_new['Age_of_team'].max()])
Oldest teams in the tournaments are given below
Team Tournament Score PlayedGames WonGames DrawnGames LostGames \
0 Team 1 86 4385 2762 1647 552 563
1 Team 2 86 4262 2762 1581 573 608
2 Team 3 80 3442 2614 1241 598 775
4 Team 5 86 3368 2762 1209 633 920
6 Team 7 82 2792 2626 948 608 1070
7 Team 8 70 2573 2302 864 577 861
13 Team 14 44 1416 1428 453 336 639
44 Team 45 7 107 130 43 21 66
51 Team 52 4 56 72 21 14 37
53 Team 54 3 42 54 18 6 30
BasketScored BasketGiven TournamentChampion Runner-up TeamLaunch \
0 5947 3140 33 23 1929
1 5900 3114 25 25 1929
2 4534 3309 10 8 1929
4 4631 3700 8 7 1929
6 3609 3889 0 0 1929
7 3228 3230 2 3 1929
13 1843 2368 0 1 1929
44 227 308 0 0 1929
51 153 184 0 0 1929
53 97 131 0 0 1929
HighestPositionHeld Champions WinResult Age_of_team
0 1 0.383721 0.596307 93
1 1 0.290698 0.572411 93
2 1 0.125000 0.474751 93
4 1 0.093023 0.437726 93
6 3 0.000000 0.361005 93
7 1 0.028571 0.375326 93
13 2 0.000000 0.317227 93
44 3 0.000000 0.330769 93
51 6 0.000000 0.291667 93
53 8 0.000000 0.333333 93
#Printing the newest team in the tournament and the age
print('Age of the newest team is: ',dataset_new['Age_of_team'].min())
print('Newest teams in the tournaments are given below ')
print(dataset_new[dataset_new['Age_of_team'] == dataset_new['Age_of_team'].min()])
Newest teams in the tournaments are given below
Team Tournament Score PlayedGames WonGames DrawnGames LostGames \
60 Team 61 1 0 0 0 0 0
BasketScored BasketGiven TournamentChampion Runner-up TeamLaunch \
60 0 0 0 0 2017
HighestPositionHeld Champions WinResult Age_of_team
60 9 0.0 NaN 5
#Teams who played maximum no. of tournaments
dataset_new['TeamExp'] = dataset_new['Age_of_team']/dataset_new['Tournament']
dataset_new['TeamExp'].dtype
print('Five teams with maximum no. of tournaments played are given below:')
dataset_new.groupby(['Team'])['TeamExp'].sum().sort_values(ascending =False).head(5)
#Teams who played minimum no. of tournaments
print('Five teams with minimum no. of tournaments played are given below:')
dataset_new.groupby(['Team'])['TeamExp'].sum().sort_values(ascending =True).head(5)
# % of championships w.r.t. experience
dataset_new['TeamPerformance'] = dataset_new['Champions']/dataset_new['TeamExp']
dataset_new['TeamPerformance'].dtype
print('Top five teams with percentage of championship with respect to experience are given below')
dataset_new.groupby(['Team'])['TeamPerformance'].sum().sort_values(ascending =False).head(5)
#Filter the dataframe
df1=dataset_new.filter(items=['Team', 'Age_of_team', 'TeamExp','TeamPerformance','WinResult','Champions'])
df1
#Plot bar graph between Age of Team and Team Performance
plt.figure(figsize = (16,16))
sns.barplot(x='Age_of_team', y='TeamPerformance',data=df1);
Comment: With the above data, We can conclude that Team with age of 93 has a better performance among the others (Team 1,2,3)
#10 Teams with better win results
df2=dataset_new.groupby(['Age_of_team'])['WinResult'].sum().sort_values(ascending =False).head(10)
plt.figure(figsize = (10,10))
sns.barplot(y = df2.values ,x = df2.index);
#Correlation between all the metrics
correl=df1.corr()
correl
#heatmap of the data
sns.heatmap(correl, annot = True);
Comment: The above correlation table shown in the heatmap shows a very high correlation between the "TeamPerformance" and "Champions"
#Top 10 teams that were champions
df3=dataset_new.groupby(['Age_of_team'])['Champions'].sum().sort_values(ascending =False).head(10)
plt.figure(figsize = (10,10))
sns.barplot(y = df3.values ,x = df3.index);
#Scatterplot to compare new vs old teams wrt "Age" and "HighestPositionHeld" (Basis: Championship)
sns.scatterplot(data=dataset_new, x="Champions", y="HighestPositionHeld",size="Age_of_team");
#Scatterplot to compare new vs old teams wrt "Age" and "HighestPositionHeld" (Basis: WinResult)
sns.scatterplot(data=dataset_new, x="WinResult", y="HighestPositionHeld",size="Age_of_team");
Comment: With the above data, We can conclude that, The team with age equal to 15 held the highest position.
sns.lmplot(x="WinResult", y="HighestPositionHeld", hue = "Team", data=dataset_new);
Comment: With the above data, We can conclude that, The team no. 57 held the highest position.
import seaborn as sns
Tournament_mean = dataset['Tournament'].mean()
Tournament_median = dataset['Tournament'].median()
Tournament_mode = dataset['Tournament'].mode()
print('The Mean: ',Tournament_mean,'\nThe Median: ',Tournament_median,'\nThe Mode: ',Tournament_mode[0])
fig, ax_hist = plt.subplots(figsize = (12.8, 6))
ax_hist = sns.distplot(dataset['Tournament'])
ax_hist.axvline(Tournament_mean, color = 'r', linestyle = '--', label = 'Mean')
ax_hist.axvline(Tournament_median, color = 'g', linestyle = '-', label = 'Median')
ax_hist.axvline(Tournament_mode[0], color = 'b', linestyle = '-', label = 'Mode')
ax_hist.set_title('3Ms and Distribution of Tournament')
plt.title("Histogram for 3Ms and Distribution of Tournament")
plt.legend(); plt.show()
The Median: 12.0
The Mode: 1
HighestPositionHeld_mean = dataset['HighestPositionHeld'].mean()
HighestPositionHeld_median = dataset['HighestPositionHeld'].median()
HighestPositionHeld_mode = dataset['HighestPositionHeld'].mode()
print('The Mean: ',HighestPositionHeld_mean,'\nThe Median: ',HighestPositionHeld_median,'\nThe Mode: ',HighestPositionHeld_mode[0])
fig, ax_hist = plt.subplots(figsize = (12.8, 6))
ax_hist = sns.distplot(dataset['HighestPositionHeld'])
ax_hist.axvline(HighestPositionHeld_mean, color = 'r', linestyle = '--', label = 'Mean')
ax_hist.axvline(HighestPositionHeld_median, color = 'g', linestyle = '-', label = 'Median')
ax_hist.axvline(HighestPositionHeld_mode[0], color = 'b', linestyle = '-', label = 'Mode')
ax_hist.set_title('3Ms and Distribution of HighestPositionHeld')
plt.legend(); plt.show()
The Median: 6.0
The Mode: 1
df4 = df1.sort_values(by='WinResult',ascending=False)
print('The Top Old Teams are shown in Table 1\nTABLE 1')
df4.sort_values(by='Age_of_team',ascending=False).head(5)
TABLE 1
df5 = df1.sort_values(by='TeamExp',ascending=False) # Top performing teams
print('The Top Old Teams are shown in Table 2\nTABLE 2')
df5.sort_values(by='Age_of_team',ascending=False).head(5)
TABLE 2
print('The Top Young Teams are shown in Table 1A\nTABLE 1A')
df4.sort_values(by='Age_of_team',ascending=True).head(5)
TABLE 1A
print('The Top Young Teams are shown in Table 1B\nTABLE 1B')
df5.sort_values(by='Age_of_team',ascending=True).head(5)
TABLE 1B
dataset.quantile(0.75) - dataset.quantile(0.25)
df=pd.DataFrame(dataset)
df
df_num = df.select_dtypes(include='number')
df_num
#print range
print(df_num.max() - df_num.min())
TeamLaunch 88
HighestPositionHeld 19
dtype: int64
#print variance
print(dataset.var());
TeamLaunch 755.376503
HighestPositionHeld 27.843169
dtype: float64
#print standard deviation
print(dataset.std())
TeamLaunch 27.484114
HighestPositionHeld 5.276663
dtype: float64
#To check the covariance
dataset.cov()
import seaborn as sns
sns.pairplot(dataset, kind="reg")
plt.show()
#Cumulative Distribution
fdataset = df.drop('Team',axis=1).head(1)
sns.distplot(fdataset, hist_kws=dict(cumulative=True), kde_kws=dict(cumulative=True));
#Creating Scatter Plot
import seaborn as sns
import pandas as pd
dataset = pd.read_csv('DS+-+Part2+-+Basketball.csv')
fdata=dataset.dropna()
sns.pairplot(fdata)
sns.scatterplot(fdata['Tournament'],fdata['HighestPositionHeld'])
print("Skewness of 'Tournament': {}\n".format(dataset['Tournament'].skew().round(3)))
print("Skewness of 'HighestPositionHeld': {}\n".format(dataset['HighestPositionHeld'].skew().round(3)))
Skewness of 'HighestPositionHeld': 0.818
UNIVARIATE ANALYSIS USING DATA VISUALIZATION TECHNIQUES
attributes = ['Tournament','Score','PlayedGames','WonGames','DrawnGames','LostGames','BasketScored','BasketGiven','TournamentChampion','Runner-up','TeamLaunch','HighestPositionHeld']
plt.figure(figsize=(20,20))
i = 1;
for colm in df[attributes]:
plt.subplot(4,3,i);
sns.distplot(df[colm],kde=True);
i += 1
#Total no. of matches played by teams
plt.figure(figsize=(50,30))
plt.xlim(0,180)
sns.barplot(df['Team'], df['Tournament']);
MULTIVARIATE ANALYSIS USING DATA VISUALIZATION
plt.figure(figsize=(50,50))
sns.pairplot(df[attributes])
PartB 3. Please include any improvements or suggestions to the association management on quality, quantity, variety, velocity, veracity etc. on the data points collected by the association to perform a better data analysis in future. At-least 1 suggestion for each point. [10 Marks]
Ans Comment: After assessing the overall data,The improvements and suggestions to the association management with respect to quality, quantity, variety, velocity, veracity etc. on the data points gathered by the relation association is to perform a better data analysis with more improvements in future. The improvements shall be in regard to:
* The teams which are already in contract with competitors.
* The particular year for which a certain decision is to be found out.
* The notable top players playing in the team.
PART- C
- Read the CSV file.
import pandas as pd
data = pd.read_csv('DS+-+Part3+-+CompanyX_EU.csv')
data.head()
data.tail()
data.shape
data.info()
RangeIndex: 662 entries, 0 to 661
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Startup 662 non-null object
1 Product 656 non-null object
2 Funding 448 non-null object
3 Event 662 non-null object
4 Result 662 non-null object
5 OperatingState 662 non-null object
dtypes: object(6)
memory usage: 31.2+ KB
- Data Exploration: [1 Mark] A. Check the datatypes of each attribute.
# Creating the DataFrame
df = pd.DataFrame(data)
print(df)
0 2600Hz 2600hz.com NaN Disrupt SF 2013
1 3DLT 3dlt.com $630K Disrupt NYC 2013
2 3DPrinterOS 3dprinteros.com NaN Disrupt SF 2016
3 3Dprintler 3dprintler.com $1M Disrupt NY 2016
4 42 Technologies 42technologies.com NaN Disrupt NYC 2013
.. ... ... ... ...
657 Zivity zivity.com $8M TC40 2007
658 Zmorph zmorph3d.com $1M -
659 Zocdoc zocdoc.com $223M TC40 2007
660 Zula zulaapp.com $3.4M Disrupt SF 2013
661 Zumper zumper.com $31.5M Disrupt SF 2012
Result OperatingState
0 Contestant Operating
1 Contestant Closed
2 Contestant Operating
3 Audience choice Operating
4 Contestant Operating
.. ... ...
657 Contestant Operating
658 Audience choice Operating
659 Contestant Operating
660 Audience choice Operating
661 Finalist Operating
[662 rows x 6 columns]
datatypes = df.dtypes
print(datatypes)
Product object
Funding object
Event object
Result object
OperatingState object
dtype: object
B. Check for null values in the attributes.
import pandas as pd
df = pd.DataFrame(data)
print(df.isnull().any())
Product True
Funding True
Event False
Result False
OperatingState False
dtype: bool
- Data preprocessing & visualisation: [4 Marks] A. Drop the null values. [1 Mark]
#Null Values are dropped
cleandata = data.dropna()
cleandata.isnull().any()
3.B. Convert the ‘Funding’ features to a numerical value.
df1 = pd.DataFrame(cleandata)
df1.loc[:,'Funds_in_million'] = df1['Funding'].apply(lambda x: float(x[1:-1])/1000 if x[-1] == 'K' else (float(x[1:-1])*1000 if x[-1] == 'B' else float(x[1:-1])))
datatypes = df1.dtypes
print(datatypes)
df1.head()
Product object
Funding object
Event object
Result object
OperatingState object
Funds_in_million float64
dtype: object
- C. Plot box plot for funds in million. [1 Mark]
plt.figure(figsize=(40,20))
sns.boxplot(data=cleandata,x='Funds_in_million');
plt.semilogx()
plt.show()
C.3.D. Check the number of outliers greater than the upper fence. [1 Mark]
funds = cleandata['Funds_in_million']
funds
Q1,Q3 = np.percentile(funds,[25,75])
IQR = Q3 - Q1;
UOUT=cleandata[cleandata['Funds_in_million']>(Q3 + 1.5 * IQR)]
print("Given below is the no. of outliers greater than the upper fence: ")
UOUT
UOUT.shape
C. 3. E. Check frequency of the OperatingState features classes. [1 Mark]
print (df['OperatingState'].value_counts())
Closed 106
Acquired 86
Ipo 5
Name: OperatingState, dtype: int64
counts = df['OperatingState'].value_counts().to_dict()
print (counts)
#Graphical Representation
plt.figure(figsize=(8,11))
sns.histplot(data=cleandata,x='OperatingState');
4. Statistical Analysis: [10 Marks]
A. Is there any significant difference between Funds raised by companies that are still operating vs companies that closed down? [1 Mark]
SOLUTION: DESIGNING NULL AND ALTERNATE HYPOTHESIS
H0: There is no statistically significant difference between money raised by firms that are still in operation and monies raised by companies that have closed down.
Ha: There is a no statistically significant difference in funds raised by firms that are still in operation vs those that have closed down.
D. Make a copy of the original data frame. [1 Mark]
newdata = cleandata.copy()
newdata
events = cleandata[cleandata["Event"].str.contains('Disrupt')].copy()
events[['EventName','EventLocation','EventYear']] = events['Event'].str.split(' ',expand=True)
events['EventYear'] = events['EventYear'].replace('-', np.nan)
events["EventYear"] = pd.to_numeric(events["EventYear"])
events = events[events["EventYear"] >= 2013]
events
op_comp = cleandata[cleandata['OperatingState'] == "Operating"]
cl_comp = cleandata[cleandata['OperatingState'] == "Closed"]